Genomic analyses of a widespread blueberry virus in the United States

Highlights • Genomic and evolutionary analysis of blueberry virus (BlVL), a new luteovirus.• A cryptic ORF is under strong positive selection.• BlVL is present in almost 80% of the more than 600 samples tested.• Most widespread blueberry virus in the United States.• Given the BlVL diversity movement by propagation material is highly unlikely.


Introduction
Blueberry (Vaccinium spp.) is a perennial crop with plants grown in the field for several decades (Retamales and Hancock, 2018). For this reason, there is a high probability that plants become infected by and accumulate viruses; acting as reservoirs for spread within and across fields. There are currently 18 viruses known to infect the crop (Saad et al., 2021;Villamor et al., 2023) with many causing significant losses wherever they occur (Martin and Tzanetakis, 2018).
Here we describe a new species in the genus Luteovirus infecting blueberry that is phylogenetically related to and shares many of the attributes of the three aforementioned luteoviruses. Seven isolates of the virus, tentatively named Blueberry virus L (BlVL), were sequenced and primers were developed to assess the prevalence of the virus in five U.S. states; where the virus was found to infect almost 80% of all plants tested.

Genome sequencing
Luteovirus-like sequences were first identified in 2017 after screening blueberry accessions for viruses using high throughput sequencing (HTS, Villamor et al., 2019). For further characterization, total RNA of several blueberries was submitted for HTS using Illumina HiSeq2500. VirFind (Ho and Tzanetakis, 2014) identified twelve luteovirus-like contigs in sample 2017-FRV-NGS41 collected in Oregon. Preliminary mapping of these contigs to the NSPaV genome identified regions of the genome not covered by the contigs. cDNA was synthesized with Maxima H Minus Reverse Transcriptase (Thermo Scientific) with random hexamer primers as per manufacturer's protocol. PCR primers which spanned the gaps between the contigs were designed (Supplementary Table 1) and these regions were amplified with Genescript Taq DNA polymerase and Sanger sequenced. The contigs and PCR amplicons resulted in a 4223 nucleotide (nt) long sequence missing the genome ends. All HTS reads were remapped to this sequence with an average of 14x coverage. The 5 ′ terminus was determined using the 5 ′ RACE System, (version 2, Thermo Fisher Scientific) whereas for the 3 ′ terminus the Poly(A)-tailing kit (Thermo Fisher Scientific) was used as per the manufacturer's instructions (primer sequences available in Supplementary Table 1). Amplicons of each terminus were cloned into pCR 2.1 TOPO vector (Thermo Fisher Scientific) and at least five clones for each region were sequenced.
An additional six near full genomes were amplified from Oregon samples (Supplementary Table 2). Reverse transcription was carried out with primer BlVL-R and Maxima H Minus Reverse Transcriptase (Thermo Scientific) as per manufacturers protocol with the optional 65 • C incubation step. Followed by PCR with Kapa HiFi polymerase and associated Fidelity buffer (Kapa Biosystems KK2102, protocol 95 • C 3min (98 • C 20 s, 70 • C 15 s , 72 • C 4. 5 min) 40x 72 • C 8 min) using primers which bind within the 5 ′ and 3 ′ UTRs (Supplementary Table 1). Single plasmid clones were sequenced at Plasmidsaurus, using Oxford Nanopore Technology (www.plasmidsaurus.com) with coverage that exceeded 700x. All isolates were analyzed with Geneious Prime 2022.1.1 (https://www.geneious.com).

Genome analyses and phylogenetics
Analyses included isolates of 19 genomes, with REFSEQ isolates used where available; the 13 luteoviruses accepted by ICTV and six putative unclassified members of the genus: almond luteovirus 1 (AlLV1), peach associated luteovirus (PaLV) and four Prunus-infecting species (Khalili et al., 2023). NCBI ORFfinder and alignments to other luteoviruses was used to identify BlVL ORFs. Sequence Demarcation Tool v1.2 (SDT)  with the MUSCLE alignment option, was used to calculate the pairwise nucleotide (nt) and amino acid (aa) identity of the genome and proteins as well as a dataset containing the eight conserved polymerase motifs present across the positive-strand RNA viruses (Koonin, 1991). Each dataset was aligned with MUSCLE, tested for the best-fit model, and maximum likelihood phylogenetic trees with 1000 bootstrap replicates were constructed within MegaX (Kumar et al., 2018). Branches with low bootstrap values (<60%) were collapsed in TreeGraph2 and unrooted trees were visualized in FigTree v1.4.4 (Stöver and Müller, 2010), (http://tree.bio.ed.ac.uk/software/ figtree/). The six near-full BlVL genomes were aligned with the reference sequence, with genome trees and percentage pairwise identities calculated as above.
ORF3a is present in the majority of luteoviruses, yet it is difficult to identify due to alternative start codons and its small size. Therefore, an alignment-based method was used in MegaX, with the sequence region between ORF2 and the start of ORF3, for all reference species, in all (+) frames with known P3a used as reference.
Other ORFs were further analyzed for evidence of transmembrane helices (TMH). All identifiable ORFs of apple luteovirus 1 (ALV1), AlLV1, BlVL, NSPaV, Prunus mahaleb-associated luteovirus (PmaLV) and RCaV greater than 75 nt including those with alternative start codons (as identified in ORFfinder), were uploaded into TMHMM -2.0 (Krogh et al., 2001). In addition, datasets of the putative TMH ORFs from 18 NSPaV and the seven BlVL isolates were compiled. For the NSPaV and BlVL datasets, the TMH sequence had gaps incorporated at the start and end to anchor the smaller TMH ORF relative to the corresponding expressed ORF. Sequences were analyzed for evolutionary selection using FUBAR within SelectionMap v1.0 (Stenzel et al., 2014). SelectionMap allows for each ORF to be analyzed for selection individually while providing a visual map of selection sites. RNA fold (Gruber et al., 2008;Lorenz et al., 2011) was used to identify the cap-independent translation elements in all BlVL genomes.  (Table 2). Samples were collected randomly across fields and symptom information is largely unavailable. Total nucleic acids were extracted from leaf tissue as per Poudel et al. (2013) and stored at − 80 • C. As samples were tested for BlVL and other blueberry viruses (data not shown), reverse transcription was carried out using both dT and random hexamer primers as described in Thekke Veetil et al. (2016) and screened with the NADH dehydrogenase ND-2 β subunit primers to evaluate the quality of cDNA (Tzanetakis et al., 2007). Screening primers BlVL3692F (5 ′ -TGGATGGCGCGTGAGTATTC-3 ′ )/BlVL4473R (5 ′ -CCTTTCGGTAGCAACAATACC-3 ′ ) were developed and amplify a 782 nt region close to the 3 ′ end of the genome. The reaction consisted of 400 µM dNTPs, 400 nM of each of the screening primers, 10X Taq buffer, 0.5 U Taq DNA polymerase (Genscript, NJ) and molecular grade water to 25μl. PCR protocol of 94 • C 2min (94 • C 20 s , 70 • C 20 s ,72 • C 40 s ) 40cycles 72 • C 8 min . Bidirectional Sanger sequencing of a subset of these positive samples was carried out. For each sample, the reads were trimmed and assembled into consensus sequences with Geneious Prime (2022.1.1) recommended parameters (0.05 error probability trim, consensus threshold highest quality 60%). The BlVL3692F/BlVL4473R sequences were further trimmed within the primer sites. The seven full genome isolates were trimmed to the same region and the percentage pairwise identities for all 286 samples were calculated in SDT as above.

BlVL genome architecture
The reference isolate of BlVL was sequenced from 2017-FRV-NGS41 and deposited under GenBank accession OQ686746. An additional six near full genomes were sequenced and deposited in GenBank under accessions OQ686747-OQ686752 (Supplementary Table 2).
Initiation of luteovirus translation occurs through the   Shen and Miller, 2004). The helicase motifs I/Ia, II-IV, and VI, have been previously identified in the luteoviruses barley yellow dwarf virus-PAV BYDVPAV and soybean dwarf virus (Habili and Symons, 1989). Motifs I/Ia-III are located in the P1 region of the replicase and IV and VI embedded within P2. The majority of the luteovirus species analyzed here, including BlVL do not include identifiable P1 motifs as there is little aa sequence consensus in the alignments in this region. All luteovirus species contain the P2 helicase motifs IV and VI; motif IV is highly conserved across all luteovirus species with BlVL present as RRVFTVGKG 396-404 whereas motif VI is present in BlVL as RYNVEVGRRLKFNEKK 500-515 .
Across all luteoviruses, the replicase share 49.7-97.5% aa identity, with the highest identities seen amongst BYDV species. Due to low similarities of the replicase across the majority of the species, a smaller conserved region of the replicase was analyzed This region spans motif I-VIII, motifs which were previously identified as conserved across the positive strand viruses (Koonin, 1991

Phylogenetics
A maximum likelihood (ML) full genome nucleotide tree (model -General Time Reversable model with rates among sites Gamma distributed with Invarient sites (GTR+GI)) shows the clustering of the BlVL relative to other luteoviruses (Fig. 3). Full genomes of the luteoviruses fall into four clades, with the BlVL clustering with RCaV, AlLV1 and NSPaV. As the luteovirus classification within the Tombusviridae is based on the replicase, a ML phylogenetic tree (model -Le Gascuel model with rates among sites Gamma distributed (LG+G)) of the fusion protein was also constructed (Supplementary Fig. 1), which shows similar clustering.

Putative proteins with transmembrane helicase signals
Three of the four viruses without identifiable movement proteins, AlLV1, BlVL and NSPaV showed evidence of putative proteins containing transmembrane helicase (TMH) signals (six, three, three ORFs respectively). All three viruses contain a putative TMH ORF in a similar location, embedded within ORF 5; BlVL (99aa), NSPaV (86aa) and AlLV1 with two ORFs (80aa and 55aa). The previously characterized ALV1 contains additional ORFs, one of which is also in the ORF5 nucleotide region (Liu et al., 2018) (Fig. 1), however this ORF did not show evidence of TMH. PmaLV which encodes P4 but not P3a, also contains a putative ORF with a TMH signal embedded at the start of ORF 5. As both AlLV1 and PmaLV have only one isolate available no further analysis was undertaken.

Selection analyses of putative TMH ORFs and P5 region
Both BlVL and NSPaV putative TMH proteins show multiple sites are evolving under positive selection with the average dN/dS (non-synonymous/synonymous substitutions) showing overall positive selection for BlVL the average dN/dS=3.61 and for NSPaV the average dN/dS=2.67 (Fig. 4). The P5 region however shows overall negative selection for both BlVL and NSPaV, with average dN/dS=0.46 and dN/dS=0.51 respectively. Comparison on the approximate locations of the TMH region over the corresponding P5 region, shows differences in the two species. For BlVL (Fig. 4), there are high numbers of sites evolving under negative selection in the P5 region located in the corresponding TMH region. Whereas with NSPaV the sites evolving under negative selection are more uniformly distributed across P5 (Fig. 4).
Additional UGACA sequences, potential BCL sequences, are present within and after the TMH ORF, at nt 3817 (seven isolates) and nt 4177 (six isolates), respectively. However, neither of these are predicted to fold into stem-loop structures. ORF4 (when present) is translated via leaky scanning (Dinesh- Kumar and Miller, 1993) yet the distance of the TMH ORF from ORF3 start codon is likely too distant (887 nt).
BlVL Prevalance BlVL was found in >50% of samples in each of the localities. Samples collected in Pennsylvania, New Jersey and Washington had an incidence of 54%, 75% and 84% respectively. Both Oregon and Michigan show especially high incidence with 90% in Oregon and 93% in Michigan (Table 2). Sanger sequencing of the positive samples resulted in 279 sequences ranging from 612 to 743 nt in length (GenBank OQ709486-OQ709764). These isolates along with the same region from the seven BlVL isolates showed significant sequence diversity, with percentage pairwise identities of 84.8-100%, with the lowest identity seen between sequences from New Jersey (NJ) and Michigan (MI) (NJ1_H11 to MI2_E4) and (MI2_E4 to NJ2_H3) (full matrix in Supplementary Data 1). Cultivars that tested positive for BlVL include: Berkley, Bluecrop, Blueray, Bluetta, Collins, Darrow, Draper, Duke, Earliblue, Elliott, Herbert, Jersey, Liberty, Nelson, Old Jersey, and Toro.

Discussion
Here we describe blueberry virus L, a novel luteovirus. The BlVL genome architecture is similar to AlLV1, NSPaV and RCaV; all having four canonical ORFs and without identifiable MPs. The ICTV species demarcation for luteoviruses is that of a virus where any protein shows >90% aa identity (King et al., 2011); with all BlVL ORFs being well below this threshold when compared to the other luteovirus species.
The seven sequenced BlVL isolates show 93.1-99.2% nt identity yet three share ~89% aa identity in P1, when compared to the others. While this is lower than the ICTV species demarcation, the rest of the proteins share >93% aa identity. A recent update of the Betaflexiviridae taxonomic criteria has been proposed; where in instances that the aa identity of one protein is close to the cutoff threshold, (which in BlVL case the lowest aa identity is 88.5% with a cutoff value of 90%) then the aa identity of an additional protein is considered (Silva et al., 2022). Therefore, taking into consideration all other aa% identities of BlVL, all isolates presented here are the same species.
When comparing the BlVL genome to other luteoviruses, BlVL has the highest percentage pairwise identity to AlLV1, NSPaV and RCaV, yet the encoded proteins do not always show highest identities to these three viruses (Table 1). P1 has highest pairwise identity to AaLV, and the replicase (P1-P2) and capsid protein share equal percentage pairwise identity with RCaV and AaLV, and AlLV1 and peach associated luteovirus (PaLV) respectively. Phylogenetically, neither AaLV or PaLV full genomes cluster near the BlVL clade suggesting recombination may have occurred across these (or ancestral viruses) during the evolution of BlVL, with evidence of recombination events identified in other luteoviruses (Pagán and Holmes, 2010). Another hypothesis would be asymmetrical evolution based on the hosts the viruses infect at the time.
Based on the high prevalence of BlVL and a survey of Prunus species that identified NSPaV in higher incidence than other prunus-infecting luteoviruses (Khalili et al., 2023), suggests that the lack of identifiable movement-associated proteins (P3a, P4) does not have a major impact on virus prevalence. We have identified a putative ORF which contains evidence of transmembrane helices, which are indicative of a protein involved in movement (Krogh et al., 2001;Morozov and Solovyev, 2020). Yet there is no identifiable mechanism on how this ORF is translated based on current knowledge of the luteovirus expression strategies. Further complexity is added when considering the selection analyses. Seemingly unfavorable, asymmetric evolution of expressed viral genes is not unusual, with a study finding half of the viruses tested were evolving asymmetrically (Pavesi, 2019). So, while it is unclear how or if the identified putative ORFs are expressed, the fact that isolates of both NSPaV and BlVL contain a putative ORF with evidence of TMH in a similar genome region and both show high positive selection while being constrained by negative selection of the overlapping ORF5, warrants further investigation.
BlVL was found in high prevalence in survey samples collected across five US states suggesting that either this virus has been circulating undetected for a long period of time, is moving through propagation material, and/or it spreads rapidly in the field. The movement of propagation material is always of concern yet the high diversity of the 286 samples sequenced suggests multiple introductions, rather than the movement of clonally propagated material. A previous study of blueberry viruses showed regional differences in the most prevalent species, with the incidence of the most prevalent virus varying from 5%− 45% in each region and the total incidence across the US of a single species at 21% (Martin and Tzanetakis, 2018). Comparatively, BlVL has higher incidence per state and total incidence in the US making it the most widespread blueberry virus in the US to date.
The high diversity and wide distribution of BlVL in commercial plantings suggests a long association of this virus, with US blueberry cultivars first released in the 1920s with many early cultivars still used today (Mainland, 2012). It also suggests that this virus does not cause significant symptoms in blueberry in single infections since it has not been identified prior to HTS analysis of blueberry. However, there is a high likelihood that these samples contain mixed infections with BlVL as these samples have previously been tested for a number of other viruses (Martin and Tzanetakis, 2018). It is unknown what impact mixed infections have on blueberry but as a long-cropped plant, the presence and accumulation of multiple viruses is a concern. Further studies into movement, symptomology and mixed infections of BlVL are needed to better understand this newly characterized virus.

Funding
This work was partially funded by the USDA under agreements AP20PPQS&TOOC170, AP21PPQS&T00C011 and USDA-NIFA Hatch project 1002361.

Author statement
The authors declare that they do not have any conflicts of interest. Supplementary Data 2: Graphical output of evidence of the presence of transmembrane helices for each isolate detected with the TMHMM-2.0 program.

Declaration of Competing Interest
The authors declare that they do not have any conflicts of interests associated with the research presented here

Data availability
Data will be made available on request.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.virusres.2023.199143.